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Abstract 

This paper describes a set of comparative exper- 
iments for tlie problem of automatically filtering 
unwanted electronic mail messages. Several vari- 
ants of the AdaBoo st algorithm with confid ence- 
rated predictions ( pchapirc fc Singer 99 ) have 
been applied, which differ in the complexity of 
the base learners considered. Two main conclu- 
sions can be drawn from our experiments: a) The 
boosting-based methods clearly outperform the 
baseline learning algorithms (Naive Bayes and 
Induction of Decision Trees) on the PUl corpus, 
achieving very high levels of the Fi measure; b) 
Increasing the complexity of the base learners al- 
lows to obtain better "high-precision" classifiers, 
which is a very important issue when misclassi- 
fication costs are considered. 



by using common measures of IR (precision, recall, 
etc.). Another important issue is the relative impor- 
tance between the two types of possible misclassifica- 
tions: While an automated filter that misses a small 
percentage of spam may be acceptable to most users, 
fewer people are likely to accept a filter that incor- 
rectly identifies a small percentage of legitimate mail 
as spam, especially if this implies the automatic dis- 
carding of the misclassifed legitimate messages. This 
problem suggests the consideration of misclassification 
costs for the learning and evaluation of spam filter sys- 
tems. 

In recent years, a vast amount of techniques have 
been applied to TC, achieving impressive perfor- 
mances in some cases. Some of the top-performing 
me thods are Ensembles of Dec ision Trees ([Weiss et af. 



1 I Introduction 



Spam-mail filtering is the problem of automatically fil- 
tering unwanted electronic mail messages. The term 
"spam mail" is also commonly referred to as "junk 
mail" or "unsolicited commercial mail" . Nowadays, 
the problem has achieved a big impact since bulk 
emailers take advantage of the great popularity of the 
electronic mail communication channel for indiscrimi- 
nately flooding email accounts with unwanted adver- 
tisements. The major factors that contribute to the 
proliferation of unsolicited spam email are the follow- 
ing two^_^j_bu]^_emaJ^ 



99 ) , Support Vector Ma chines ( Joachims 98| ), Boosting 



( Schapire fc Singer 00 ) and Instance-based Learning 
( Yang fc Liu 99 ) . Spam filtering has also been treated 



as a particular case of TC. In ( [Cohen 96] ) a method 
based on TF-IDF weighting and the rule learning al- 
gorithm RIPPER is used to classify and filter email, 
( ^ahami et al. 98 ) used the Naive Bayes approach to 
filter spam email. (Drucker et al. 99) compared Sup- 
port Vector Machines (SVM), boosting of C4.5 trees, 
RIPPER and Rocchio, concluding that SVM's and 
boosting are the top-performing methods and suggest- 
ing that SVM's are slightly better in distinguishing the 



2) J)seudonyms arc inexpensive to obtain (Cranor fc 



two types of misclassification. In ( Androutsopoulos et 



[LaMacchia 9g). On the contrary, individuals may 
wa ^te a large amount of time transferring unwanted 



al. 00a) Sahami's Naive Bayes is compared against the 
TiMBL Memory-based learner. In ( [Androutsopoulos 



et al. 00b) the same authors present a new public data 



messages to their computers and sorting through those 
messages once transferred, to the point that they may 
be likely to become overwhelmed by spam. 

Automatic IR methods are well suited for address- 
ing this problem , since spam messages can be dis- 
tinguished from the "legitimate" email messages be- 
cause of their particular form, vocabulary and word 
patterns, which can be found in the header or body of 
the messages. 

The spam filtering problem can be seen as a par- 
ticular instance of the Text Categorization problem 
(TC), in which only two classes are possible: spam 
and legitimate. However, since one is the opposite of 
the other, it also can be seen as the problem of iden- 
tifying a single class, spam. In this way, the evalua- 
tion of automatic spam filtering systems can be done 



set which might become a standard benchmark corpus, 
and introduce cost-sensitive evaluation measures. 

In this paper, we show that the AdaBoost algorithm 
with confidence-rated predictions is a very well suited 
algorithm for addressing the spam filtering problem. 
We have obtained very accurate classifiers on the PUl 
corpus, and we have observed that the algorithm is 
very robust to overfitting. Another advantage of using 
AdaBoost is that no prior feature filtering is needed 
since it is able to efficiently manage large feature sets 
(of tens of thousands). 

In the second part of the paper we show how in- 
creasing the expressiveness of the base learners can 
be exploited for obtaining the "high-precision" filters 
that are needed for real user applications. We have 
evaluated the results of such filters using the measures 



introduced in ( Androutsopoulos et al. 00b), which 
take into account the misclassification costs. We have 
substantially improved the results mentioned in that 
work. 

The paper is organized as follows: Section || is de- 
voted to explain the AdaBoost learning algorithm and 
the variants used in the comparative experiments. In 
section ^ the setting is presented in detail, including 
the corpus and the experimental methodology used. 
Section || reports the experiments carried out and the 
results obtained. Finally, section || concludes and out- 
lines some lines for further research. 

2 The AdaBoost Algorithm 



In this section the AdaBoost algorithm (Schapire & 
finger 99| ) is described, restricting to the case of binary 
classification. 

The purpose of boosting is to find a highly accurate 
classification rule by combining many weak rules (or 
weak hypotheses), each of which may be only moder- 
ately accurate. It is assumed the existence of a sep- 
arate procedure called the WeakLearner for acquiring 
the weak hypotheses. The boosting algorithm finds 
a set of weak hypotheses by calling the weak learner 
repeatedly in a series of T rounds. These weak hy- 
potheses are then linearly combined into a single rule 
called the combined hypothesis. 

Let S = {{xi.yi), . . . ,{xm,ym)} be the set of m 
training examples, where each instance Xi belongs to 
a instance space X and yi e { — is the class or 
label associated to Xi. The goal of the learning is to 
produce a function of the form / : A" — > R, such that, 
for any example x, the sign of f{x) is interpreted as 
the predicted class (—1 or -1-1), and the magnitude 
I /(a::)! is interpreted as a measure of confidence in the 
prediction. Such a function can be used either for 
classifying or ranking new unseen examples. 

The pseudo-code of AdaBoost is presented in Fig- 
ure |l|. It maintains a vector of weights as a distribution 
D over examples. The goal of the WeakLearner algo- 
rithm is to find a weak hypothesis with moderately low 
error with respect to these weights. Initially, the dis- 
tribution Di is uniform, but the boosting algorithm 
exponentially updates the weights on each round to 
force the weak learner to concentrate on the examples 
which are hardest to predict by the preceding hypothe- 
ses. 

More precisely, let Dt be the distribution at round t, 
and ht : A" ^ K the weak rule acquired according to 
Dt. In this setting, weak hypotheses ht{x) also make 
real-valued confidence-rated predictions (i.e., the sign 
of ht{x) is the predicted class, and |/it(a;)| is inter- 
preted as a measure of confidence in the prediction). 
A parameter at is then chosen and the distribution 
Dt is updated. The choice of at will be determined 
by the type of weak learner (see next section) . In the 
typical case that at is positive, the updating function 
decreases (or increases) the weights Dt{i) for which ht 



procedure AdaBoost (in: S = {(a;^, yi)}^i) 

### S is the set of training examples 

### Initialize distribution Di (for all i, 1 < i < m) 

Di{i) = l/m 

for t~l to r do 

### Get the weak hypothesis ht : X —tM. 

ht = WeakLearner (X, A); 

Choose at G R; 

### Update distribution Dt (for all i, 1 < i < m) 

Dt(i)exp{~atyihtixi)) 



Dt+i(i) 



Zt 



### Zt is chosen so that Dt+i will be a distribution 
end- for 

T 

return the combined hypothesis: f{x) — atht{x) 
end AdaBoost 

Figure 1: The AdaBoost algorithm 

makes a good (or bad) prediction, and this variation is 
proportional to the confidence \ht{xi)\. The final hy- 
pothesis, /, computes its predictions using a weighted 
vote of the weak hypotheses. 

In ( Schapire fc Singer 99| ) it is proven that the train- 
ing error of the AdaBoost algorithm on the training set 
(i.e. the fraction of training examples i for which the 
sign of f{xi) differs from yi) is at most Ylt=i ^t, where 
Zt is the normalization factor computed on round t. 
This upper bound is used in guiding the design of 
both the parameter at and the WeakLearner algorithm, 
which attempts to find a weak hypothesis ht that min- 
imizes Zt. 

2.1 Learning weak rules 

In ( Schapire fc Singer 99| ) three different variants of 
AdaBoost. MH are defined, corresponding to three dif- 
ferent methods for choosing the at values and cal- 
culating the predictions of the weak hypotheses. In 
this work we concentrate on AdaBoost with real-valued 
predictions since it is the one that has achieved the best 



results in the Text Categorization domain (Schapire fc 
Singer 0^ ) 



According to this setting, weak hypotheses are sim- 
ple rules with real-valued predictions. Such simple 
rules test the value of a boolean predicate and make 
a prediction based on that value. The predicates used 
refer to the presence of a certain word in the text, e.g. 
"the word money appears in the message" . Formally, 
based on a given predicate p, our interest lies on weak 
hypotheses h which make predictions of the form: 



h{x) = 



Co if p holds in x 
c\ otherwise 



where the cq and c\ are real numbers. 



For a given predicate p, the values co and ci are 
calculated as follows. Let Xi be the subset of exam- 
ples for which the predicate p holds and let Xq be the 
subset of examples for which the predicate p does not 
hold. Let [tt], for any predicate tt, be 1 if tt holds and 
otherwise. Given the current distribution Dt, the 
following real numbers are calculated for j G {0, 1}, 
and for be{+l, -1}: 



i) [xi e Xj A 



1=1 



That is, is the weight, with respect to the dis- 
tribution Dt , of the training examples in partition Xj 
which are of class b. As it is shown in (Schapire & 



Singer 99) Zt is minimized for a particular predicate 



by choosing: 



■In 



and by setting at = 1. These settings imply that: 



Zt^2 J2 V^+iW^-i 

J6{0,1} 



(1) 



(2) 



Thus, the predicate p chosen is that for which the value 
of Zt is smallest. 

Very small or zero values for the parameters Wj^ 
cause Cj predictions to be large or infinite in mag- 
nitude. In practice, such large predictions may 
cause numerical problems to the algorithm, and seem 
to increase the tendency to overfit. As suggested 
in ( ^chapire fc Singer PC ), the smoothed values for 



Cj have been considered. 

It is important to see that the so far presented weak 
rules can be directly seen as decision trees with a sin- 
gle internal node (which tests the value of a boolean 
predicate) and two leaf nodes that give the real- valued 
predictions for the two possible outcomes of the test. 
These extremely simple decision trees are sometimes 
called decision stumps. In turn, the boolean predicates 
can be seen as binary features (we will use the word 
feature instead of predicate from now on), thus, the al- 
ready described criterion for finding the best weak rule 
(or the best feature) can be seen as a natural splitting 
criterion and used for performing decision-tree induc- 
tion ( ^chapire fc Singer 99 ) 



Following the idea suggested in (Schapire & Singer 
99|]we have extended the WeakLearner algorithm to 
induce arbitrarily deep decision trees. The splitting 
criterion used consists in choosing the feature that 
minimizes equation (2), while the predictions at the 
leaves of the boosted trees are given by equation (1). 
Note that the general AdaBoost procedure remains un- 
changed. 

In this paper, we will denote as TreeBoost the Ad- 
aBoost. MH algorithm including the extended Weak- 
Learner. TreeBoost[(i] will stand for a learned classi- 
fier with weak rules of depth d. As a special case, 
TreeBoost[0] will be denoted as Stumps. 



3 Setting 

3.1 Domain of Application 

We have evaluated our system on the PUl benchmark 
corpusj^ for the anti-spam email filtering problem. It 
consists of 1,099 messages: 481 of them are spam and 
the remaining 618 are legitimate. The corpus is pre- 
sented partitioned into 10 folds of the same size which 



maintain the distribution of spam messages ( Androut 
sopoulos et al. GOb| ). All our experiments have been 
performed using IG-fold cross-validation. 

The feature set of the corpus is a bag-of-words 
model. Four versions are available: with or with- 
out stemming, and with or without stop-word re- 
moval. The experiments reported in this paper have 
been performed using the non-stemming non-stop- 
word-removal version, which consists in a set of 26,449 
features. 

3.2 Experimental Methodology 

Evaluation Measures. Measures for evaluating the 
spam filtering system are introduced here. Let S and L 
be the number of spam and legitimate messages in the 
corpus, respectively; let S+ denote the number of spam 
messages that are correctly classified by a system, and 
S- the number of spam messages misclassified as legit- 
imate. In the same way, let L_|_ and L_ be the number 
of legitimate messages classified by a system as spam 
and legitimate, respectively. These four values form 
a contingency table which summarizes the behaviour 
of a system. The widely- used measures precision (p), 
recall (r) and F/3 are defined as follows: 



^4 



^4 



P 



S+ + S- 



(/3^ + \)pr 
P'^p + r 



The measure combines precision and recall, and 
with /3 = 1 gives equal weigth to the combined mea- 
sures. Additionally, some experiments in the pa- 
per will also consider the accuracy measure (acc = 



L+S , , , 

A way to distinguish the two types of misclas- 
sification is the use of utility m easures ( [ Lewis 95 ) 
used in the TREC evaluations ( Robertson fc Hull 
|0l|) . In this general measure, positions in the con- 
tingency table are associated loss values, \s+, As_, 
Al+, Al-, which indicate how desirable are the out- 
comes, according to a user-defined scenario. The over- 
all performance of a system in terms of the utility is 
S+Xs+ + S-\s- + L+Al+ -I- L- Aj 



Androutsopoulos et al. ( [Androutsopoulos et al. 



301 ) propose particular scenarios in which misclassi- 
fying a legitimate message as spam is A times more 
costly than the symmetric misclassification. In terms 
of utility, these scenarios can be translated to As+ = 0, 
Xs- = —1, Ai+ = —A and Xl- — 0. They also intro- 
duce the weighted accuracy (WAcc) measure, a version 

^PUl Corpus is freely available from the publicat ions 
section of tittp: //www. iit . demokritos . grZ-^ionandr 



of accuracy sensitive to A-cost: 

A ■ i_ + 5+ 



WAcc = 



X-L + S 



When evaluating filtering systems, this measure suf- 
fers from the same problems as standard accuracy 
(Yang 99). Despite this fact, we will use it for com- 



parison purposes. 

3.3 Baseline Algorithms 

In order to compare our boosting methods against 
other techniques, we include the following two base- 
line measures: 

• Decision Trees. Standard TDIDT learning algo- 
rithm, using the RLM distance-based function for 
the feature selection. See (Marquez 9£) for com- 
plete details about the particular implementation. 

• Naive Bayes. We include the best results on the 



PUl Corpus reported in ( Androutsopoulos et al. 
00b| ), corresponding to a Naive Bayes classifier. 



4 Experiments 

This section explains the set of experiments carried 
out. As said in section all experiments work with 
the PUl Corpus. 

4.1 Comparing methods on the corpus 

The purpose of our first experiment is to show the 
general performance of boosting methods in the spam- 
filtering domain. Six AdaBoost classifiers have been 
learned, setting the depth of the weak rules from to 
5; we denote each classifier as TreeBoost[d], where d 
stands for the depth of the weak rules; as a particular 
case, we denote the TreeBoost[0] classifier as Stumps. 
Each version of TreeBoost has been learned for up to 
2,500 weak rules. 

Figure || shows the Fi measure of each classifier, 
as a function of the number of rounds used. In this 
plot, there are also the obtained rates of the base- 
line algorithms. It can be seen that TreeBoost clearly 
outperforms the baseline algorithms. The experiment 
also shows that, above a certain number of rounds, 
all TreeBoost versions achieve consistent good results, 
and that there is no overfitting in the process. Af- 
ter 150 rounds of boosting, all versions reach an Fi 
value above 97%. It can be noticed that the deeper the 
weak rules, the smaller the number of rounds needed to 
achieve good performance. This is not surprising, since 
deeper weak rules handle much more information. Ad- 
ditionally, the figure shows that different number of 
rounds produce slight variations in the error rate. 

A concrete value for the T parameter of the Tree- 
Boost learning algorithm must be given, in order to 
obtain real classifiers and to be able to make compar- 
isons between the different versions of TreeBoost and 
baseline methods. To our knowledge, it is still unclear 
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Figure 2: Fi measure of Stumps and TreeBoostfd], for 
increasing number of rounds 



what is the best way for choosing T. We have esti- 
mated the T parameter in a validation set built for the 
task, with the following procedure: a) For each trial 
in the cross-validation experiment, 8 of the 9 training 
subsets are used to learn up to 2,500 rounds of boost- 
ing, and the one remaining is used as the validation set 
for testing the classifier with respect to the number of 
rounds, in steps of 25. b) The outputs of all classifiers 
arc used to compute the Fi measure, c) The mini- 
mum T for which the Fi is maximum is chosen as the 
estimated optimal number of rounds for all classifiers. 

Table presents the results of all classifiers. For 
each one, we include the number of rounds (estimated 
in the validation set), recall, precision, Fi and the 
maximum Fi achieved over the 2,500 rounds learned. 
According to the results, boosting classifiers clearly 
outperform the other algorithms. Only Naive Bayes 
achieves a precision (95.11%) slightly lower than the 
obtained by boosting classifiers (the worse is 97.48%); 
however, its recall at this point is much lower. 





T 


Recall 


Free. 




jpmax 


N. Bayes 




83.98 


95.11 


89.19 




D. Trees 




89.81 


88.71 


89.25 




Stumps 


525 


96.47 


97.48 


96.97 


97.39 


TreeBoost[l] 


525 


96.88 


97.90 


97.39 


97.60 


TreeBoost[2] 


725 


96.67 


98.31 


97.48 


97.59 


TreeBoost[3] 


675 


96.88 


97.90 


97.39 


97.81 


TreeBoost[4] 


450 


97.09 


98.73 


97.90 


98.01 


TreeBoost[5] 


550 


96.88 


98.52 


97.69 


98.12 



Table 1 : Performance of all classifiers 

Accuracy results have been compared using the 10- 
fold cross- validated paired t test. Boosting classifiers 
perform significantly better than Decision Treesj^ On 
the contrary, no significant differences can be observed 



^Since we do not own the Naive Bayes classifiers, no 
tests have been ran; but presumably boosting methods are 
also significantly better. 



between the different versions of TreeBoost. More in- 
terestingly, it can be noticed that accuracy and preci- 
sion rates shghtly increase with the expressiveness of 
the weak rules, and that this improvement does not 
affect the recall rate. This fact will be exploited in the 
following experiments. 

4.2 High-Precision classifiers 

This section is devoted to evaluate TreeBoost in high- 
precision scenarios, where only a very low (or null) 
proportion of legitimate to spam misclassifications is 
allowed. 

Rejection Curves. We start by evaluating if the 
confidence of a prediction, i.e., the magnitude of the 
prediction, is a good indicator of the quality of the 
prediction. For this purpose, rejection curves are com- 
puted for each classifier. The procedure to compute a 
rejection curve is the following: For several points p be- 
tween and 100, reject the p% of the predictions whose 
confidences score lowest, both positive or negative, and 
compute the accuracy of the remaining (100 — p)% 
predictions. This results in higher accuracy values as 
long as p increases. Figure ^ plots the rejection curves 
computed for the six learned classifiers. The following 
conclusions can be drawn: 

• The confidence of a prediction is a good indicator 
of its quality. 

• Depth of weak rules greatly improves the qual- 
ity of the predictions. Whereas Stumps needs 
to reject the 73% of the less confident examples 
to achieve a 100% of accuracy, TreeBoost[5] only 
needs 23%. In other words, deeper TreeBoost fil- 
ters concentrate the misclassified examples closer 
to the decision threshold. 

• The previous fact has important consequences for 
a potential final email filtering application, with 
the following specification: Messages whose pre- 
diction confidence is greater than a threshold r 
are automatically classified: spam messages are 
blocked and legitimate messages are delivered to 
the user. Messages whose prediction confidence is 
lower than t are stored in a special fold for dubi- 
ous messages. The user has to verify if these are 
legitimate messages. This specification is suitable 
for having automatic filters with different degrees 
of strictness (i.e., different values for the r param- 
eter). T values could be tuned using a validation 
set. 

Cost— Sensitive Evaluation. In this section, Tree- 
Boost classifiers are evaluated using the A-cost mea- 
sures introduced in se ction p|. Three scenarios of st rict- 
ness are presented in ( Androutsopoulos et al. 00b): a) 



Stumps 



TreeBoost[l] 



No cost considered, corresponding to A = 1; b) Semi- 
automatic scenario, for a moderately accurate filter, 
giving A = 9; and c) Completely automatic scenario. 
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Figure 3: Rejection curves for all TreeBoost classifiers. 
X axis: percentage of rejected predictions; y axis: ac- 
curacy. 



for a very high accurate filter, assigning A = 999. As 
noted in section ^, we will consider these scenarios as 
particular utility matrices. 



In (3chapire et al. 9S) a modification of the Ada- 



Boost algorithm for handling general utility matrices 
is presented. The idea is to initialize the weight dis- 
tribution of examples according to the given utility 
matrix, and then run the learning algorithm as usual. 
We have performed experiments with this setting, but 
the results are not convincing: only the initial rounds 
of boosting are affected by the initialization based on 
utility; after a number of rounds, the performance 
seems to be like if no utility had been considered. Since 
our procedure for tuning the number of rounds can not 
determine when the initial stage ends, we have rejected 
this approach. We think that the modification of the 
AdaBoost algorithm should also consider the weight 
update. 

Another approach consists in adjusting the decision 
threshold 9. In a default scenario, corresponding to 
A = 1, an example is classified as spam if its predic- 
tion is greater than 0; in this case, = 0. Increasing 
the value of 6 results in a higher precision classifier. 



(Lewis 95) presented a procedure for calculating the 



optimal decision threshold for a system, given an ar- 
bitrary utility matrix. The procedure is valid only 
when the system outputs probabilities, so the predic- 



tion scores resulting from the boosting classifications 
should be mapped to probabilities. A method for esti- 
mating probabilities given the output of AdaBoost is 
suggested in ( Friedman et al. ar ), using a logistic func- 
tion. Initial experiments with this function have not 
worked properly, because relatively low predictions are 
sent to extreme probability values. A possible solution 
would be to scale down the predictions before applying 
the probability estimate; however, it can be observed 
that prediction scores grow with both the number and 
the depth of the used weak rules. Since many param- 
eters are involved in this scaling, we have rejected the 
probability estimation of predictions. 

Alternatively, we make our classification scheme 
sensitive to A factor by tuning the 9 parameter to 
the value which maximizes the weighted accuracy mea- 
sure. Once more, the concrete value for 9 is obtained 
using a validation set, in which several values for the 
parameter are tested. Table |^ summarizes the results 
obtained from such procedures, giving A factor values 
of 9 and 999. Results obtained in ( Androutsopoulos et 
al. |00b| ) are also reported. 

Again, TreeBoost clearly outperforms the baseline 
methods. With A = 9, very high-precision rates are 
achieved, maintaining considerably high recall rates; 
it seems that the depth of TreeBoost slightly improves 
the performance, although no significant differences 
can be achieved. For A — 999, precision rates of 
100% (which is the implicit goal in this scenario) are 
achieved, except for Stumps, maintaining fair levels 
of recall. However, recall rates are slightly unstable 
with respect to the depth of TreeBoost varying from 
64.45% to 76.30%. Our impression is that high val- 
ues in the A factor seem to introduce instability in 
the evaluation, which becomes oversensitive to out- 
liers. In this particular corpus (which contains 1,099 
examples), weighted accuracy does not seem to work 
properly when giving A values of 999, since the mis- 
classification of only one legitimate message leads to 
score worse than if any email had been filtered (this 
would give WAcc = 99.92%). Moreover, for 100% 
precision values, the recall variation from 0% to 100% 
only affects the measure in 0.08 units. 

In order to give a clearer picture of the behaviour 
of classifiers when moving the decision threshold, we 
include in Figure ^ the precision-recall curves of each 
classifier. These curves are built giving 9 a wide range 
of values, and computing for each value the recall and 
precision rates. In these curves, high-precision rates of 
100%, 99%, 98% and 95% have been fixed so as to ob- 
tain the recall rate at these points. Table ^ summarizes 
these samples. All the variants are indistinguishable 
at level of 95% of precision. However, when moving to 
higher values of precision (> 95%) a significant differ- 
ence seems to occur between Stumps and the rest of 
variants using deeper weak rules. This fact proves that 
increasing the expressiveness of the weak rules can im- 
prove the performance when requiring very high pre- 



cision filters. Unfortunately, no clear conclusions can 
be drawn about the most appropriate depth. Paren- 
thetically, it can be noted that TreeBoost[4] achieves 
the best recall rates in this particular corpus. 



Method 


100% 


99% 


98% 


95% 


Stumps 


62.37 


87.94 


94.17 


98.75 


TreeBoost[l] 


81.91 


91.26 


96.88 


98.75 


TreeBoost[2] 


81.49 


90.64 


97.08 


98.54 


TreeBoost[3] 


77.54 


93.13 


96.88 


98.54 


TreeBoost[4] 


80.24 


96.25 


97.71 


98.75 


Tree Boost [5] 


77.75 


93.55 


97.29 


98.75 



Table 3: Recall rate of filtered spam messages with 
respect to fixed points of precision rate 
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Figure 4: Precision-Recall curves and recall values for 
the fixed precision rates at 100%, 99%, 98% and 95%. 
X axis: recall; y axis: precision. 



5 Conclusions 

The presented experiments show that AdaBoost learn- 
ing algorithm clearly outperforms Decision Trees and 
Naive Bayes methods on the public benchmark PUl 
Corpus. In this data set, the method is resistant to 
overfitting and Fi rates above 97% are achieved. Pro- 
cedures for automatically tune the classifier parame- 
ters, such as the number of boosting rounds, are pro- 
vided. 
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Table 2: Cost-sensitive evaluation results 



In scenarios where high-precision classifiers are re- 
quired, AdaBoost classifiers have been proved to work 
properly. Experiments have exploited the expressive- 
ness of the weak rules when increasing their depth. It 
can be concluded that deeper weak rules tend to be 
more suitable when looking for a very high precision 
classifier. In this situation, the achieved results on the 
PUl Corpus are fairly satisfactory. 

Two AdaBoost classifiers capabilities have been 
shown to be useful in final email filtering applications: 
a) The confidence of the predictions suggests a filter 
which only blocks the more confident messages, deliv- 
ering the remaining messages to the final user, b) The 
classification threshold can be tuned to obtain a very 
high precision classifier. 

As a future research line, we would like to study 
the presented techniques in a larger corpus. We think 
that the PUl corpus is too small and also too easy: 
default parameters produce very good results, and the 
tuning procedures result only in slight improvements. 
Moreover, some experiments not reported here (which 
study the effect of the number of rounds, the use of 
richer feature spaces, etc.) have shown us that the 
confidence of classifiers depends on several parameters. 
Using a larger corpus, the effectiveness of the tuning 
procedures would be more explicit and, hopefully, clear 
conclusions about the optimal parameter settings of 
AdaBoost could be drawn. 

Another line for future research is the introduction 
of misclassification costs inside the AdaBoost learning 
algorithm. Initial experiments with the method pro- 
posed in ( ^chapirc et al. 9^ ) have not worked properly, 
although we believe that learning directly classifiers 
according to some utility settings will perform better 
than tuning a classifier once learned. 
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